Search CORE

92 research outputs found

Une approche par boosting à la sélection de modèles pour l’analyse syntaxique statistique

Author: Bawden Rachel
Publication venue: HAL CCSD
Publication date: 26/06/2015
Field of study

International audienceIn this work we present our approach to model selection for statistical parsing via boosting. The method is used to target the inefficiency of current feature selection methods, in that it allows a constant feature selection time at each iteration rather than the increasing selection time of current standard forward wrapper methods. With the aim of performing feature selection on very high dimensional data, in particular for parsing morphologically rich languages, we test the approach, which uses the multiclass AdaBoost algorithm SAMME (Zhu et al., 2006), on French data from the French Treebank, using a multilingual discriminative constituency parser (Crabbé, 2014). Current results show that the method is indeed far more efficient than a naïve method, and the performance of the models produced is promising, with F-scores comparable to carefully selected manual models. We provide some perspectives to improve on these performances in future work

INRIA a CCSD electronic archive server

Une approche par boosting à la sélection de modèles pour l’analyse syntaxique statistique

Author: Bawden Rachel
Publication venue: HAL CCSD
Publication date: 26/06/2015
Field of study

INRIA a CCSD electronic archive server

Hal-Diderot

Document Sub-structure in Neural Machine Translation

Author: Bawden Rachel
Dobreva Radina
Zhou Jie
Publication venue
Publication date: 10/03/2020
Field of study

Current approaches to machine translation (MT) either translate sentences in isolation, disregarding the context they appear in, or model context at the level of the full document, without a notion of any internal structure the document may have. In this work we consider the fact that documents are rarely homogeneous blocks of text, but rather consist of parts covering different topics. Some documents, such as biographies and encyclopedia entries, have highly predictable, regular structures in which sections are characterised by different topics. We draw inspiration from Louis and Webber (2014) who use this information to improve statistical MT and transfer their proposal into the framework of neural MT. We compare two different methods of including information about the topic of the section within which each sentence is found: one using side constraints and the other using a cache-based model. We create and release the data on which we run our experiments - parallel corpora for three language pairs (Chinese-English, French-English, Bulgarian-English) from Wikipedia biographies, which we extract automatically, preserving the boundaries of sections within the articles.Comment: Accepted at LREC 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Edinburgh-Uppsala University’s Submission to the WMT 2020 Chat Translation Task

Author: Bawden Rachel
Hardmeier Christian
Moghe Nikita
Publication venue
Publication date: 01/01/2020
Field of study

Publikationer från Uppsala Universitet

Edinburgh Research Explorer

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Few-shot learning through contextual data augmentation

Author: Artaud Farid
Bawden Rachel
Birch Alexandra
Publication venue
Publication date: 31/03/2021
Field of study

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pre-trained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.Comment: 14 pages includince 3 of appendice

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Edinburgh Research Explorer

ParBLEU: Augmenting Metrics with Automatic Paraphrasesfor the WMT’20 Metrics Shared Task

Author: Bawden Rachel
Post Matt
Tättar Andre
Zhang Biao
Publication venue
Publication date: 19/11/2020
Field of study

Edinburgh Research Explorer

Boosting for Efficient Model Selection for Syntactic Parsing

Author: Bawden Rachel
Crabbé Benoît
Publication venue: HAL CCSD
Publication date: 11/12/2016
Field of study

International audienceWe present an efficient model selection method using boosting for transition-based constituency parsing. It is designed for exploring a high-dimensional search space, defined by a large set of feature templates, as for example is typically the case when parsing morphologically rich languages. Our method removes the need to manually define heuristic constraints, which are often imposed in current state-of-the-art selection methods. Our experiments for French show that the method is more efficient and is also capable of producing compact, state-of-the-art models

INRIA a CCSD electronic archive server

Hal-Diderot

A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing

Author: Bawden Rachel
Post Matt
Tättar Andre
Yankovskaya Lisa
Zhang Biao
Publication venue
Publication date: 22/09/2020
Field of study

We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU's capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.Comment: Accepted in the Findings of EMNLP 202

arXiv.org e-Print Archive

Edinburgh Research Explorer